Take Home Exercise 03

A Study on the Financial health of the residents og Engagement, Ohio, USA.

Rakendu Ramesh https://www.linkedin.com/in/rakendu-ramesh/ (Singapore Management University)https://www.linkedin.com/showcase/smumitb/
2022-05-16

Overview

In this Take Home exercise, we strive to answer the following questions about the financial health of the residents of Engagement, Ohio, USA.

Dataset

The dataset used is available for download here. Out of the data available, we will be using the Financial Journal to understand the income and expense of the participants over a period of time.

Getting Started

We will first load the required packages using the below code chunk

packages = c('tidyverse','ggdist','gghalves','ggthemes','hrbrthemes','ggridges','patchwork','zoo', 'ggrepel','ggiraph','lubridate','gganimate','scales')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Importing Data

The code chunk below imports FinancialJournal.csv and Participants.csv from the data folder by using read_csv() function of readr package into R and save it as a tibble dataframe called financial_data.

financial_data <- read_csv("data/FinancialJournal.csv")
participants_data <- read_csv("data/Participants.csv")

Data Wrangling

In the below code chunk, financial data of participants are grouped by Month and Aggregated to find their monthly Income and Expenses. We will also calcualte the savings of the participants. The data is then joined with Participants data so that we have information about participants education level and other details.

level_order <- c('Graduate','Bachelors','HighSchoolOrCollege','Low')

  participant_fin <- financial_data %>%
  mutate(date = as.yearmon(timestamp)) %>%
  group_by(participantId, date) %>%
  summarise(income = sum(ifelse(amount > 0,amount,0)), expense = sum(ifelse(amount <= 0,amount,0))) %>%
  mutate(savings = round(income + expense,digits = 0)) %>%
  inner_join(participants_data, by =c('participantId'))%>%
  mutate(educationLevel = factor(educationLevel, levels = level_order))

We will further wrangle the data to find the average income of participants with different Education Levels.

Education_fin <- participant_fin %>%
  group_by(educationLevel,date) %>%
  summarise(income = mean(income), expense= mean(expense))

Visualising the Income of Participants to understand if the Education Level has any impact

From the below plot, we can see that the participants with Low education have lower Income where as participants with Graduate and Bachelor level education has higher Income. It is observed that there is a major dip in Income in April, 2022 and Feb, 2023 where as the income is high in March,2022, Aug,2022 and March 2023.

participant_fin %>%
  ggplot(aes(x=date, y = income, group =participantId, color =educationLevel))+
  geom_line_interactive(size = 0.6)+
  ylab("Income")+
  xlab("Month, Year")+
  theme(axis.title.y=element_text(angle =0, margin = margin(t=-50,r=-20)),
        axis.title.x=element_text(margin = margin(t=-10)),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_line(colour = "grey90"),
        panel.grid.major.x = element_line(colour = "grey90"),
        panel.grid.major.y = element_line(colour = "grey90"),
        panel.background = element_rect(fill = "white"),
        axis.text.x = element_text(size =16, angle = 45, margin = margin(t = 30)),
        axis.text.y = element_text(size =16),
        axis.line = element_line(color="grey25", size = 0.02),
        axis.title = element_text(size=16),
        legend.title = element_text(size =16),
        legend.text = element_text(size = 16),
        plot.title = element_text(size =20,hjust = 0.5))+
  ggtitle("Income of Participants by Education Level")

Making the Line Graph interactive by adding Tooltip

Since there are too many lines, let us add a tooltip to display the income and education level on hover

participant_fin$tooltip <- c(paste0(
  "Id = ", participant_fin$participantId,
  "\n Income = $", round(participant_fin$income,digits = 0),
  "\n Education :",participant_fin$educationLevel))

p1 <- participant_fin %>%
  ggplot(aes(x=date, y = income, group =participantId, color =educationLevel))+
  geom_line_interactive(aes(tooltip = tooltip),size =0.4)+
  ylab("Income")+
  xlab("Month, Year")+
  theme(panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_line(colour = "grey90"),
        panel.grid.major.x = element_line(colour = "grey90"),
        panel.grid.major.y = element_line(colour = "grey90"),
        panel.background = element_rect(fill = "white"),
        axis.text.x = element_text(size =16, angle = 45, margin = margin(t = 30,r=30)),
        axis.text.y = element_text(size =16),
        axis.line = element_line(color="grey25", size = 0.2),
        axis.title = element_text(size=16),
        legend.title = element_text(size =16),
        legend.text = element_text(size = 16),
        plot.title = element_text(size =20,hjust = 0.5))+
  ggtitle("Income of Participants by Education Level")

girafe(
  ggobj = p1,
  width_svg = 12,
  height_svg = 12*0.618
)

Comparing the average income of different education groups

Let us clean up this graph to see the average Income by Education Level

Education_fin %>%
  ggplot(aes(x=date, y = income,col =educationLevel))+
  geom_line(size = 0.75)+
  ylab("Income")+
  xlab("Month, Year")+
  theme(axis.title.y=element_text(angle =0),
        axis.title.x=element_text(margin = margin(t=-10)),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_line(colour = "grey90"),
        panel.grid.major.x = element_line(colour = "grey90"),
        panel.grid.major.y = element_line(colour = "grey90"),
        panel.background = element_rect(fill = "white"),
        axis.text.x = element_text(size =16, angle = 45, margin = margin(t = 30)),
        axis.text.y = element_text(size =16),
        axis.line = element_line(color="grey25", size = 0.02),
        axis.title = element_text(size=16),
        legend.title = element_text(size =16),
        legend.text = element_text(size = 16),
        plot.title = element_text(size =20,hjust = 0.5))+
  ggtitle("Average Income by Education Level")

Comparing the income and expense of different Education Levels

Let us now try to visualize both income and expense of different Education Levels on the same graph

Education_fin %>%
  ggplot(aes(x=date))+
  geom_line(aes(y= income,col =educationLevel),size = 0.75)+
  geom_line(aes(y = abs(expense), col =educationLevel ),size = 0.75,linetype = 2)+
  scale_color_manual('Education Level', values = c('blue','green','red','black'))+
  scale_linetype_manual(name='Income/Expense',values = c(1,2), labels = c('income','expense'))+
  scale_y_continuous(name = "Income",
                     sec.axis = sec_axis(trans = ~.,name="Expense"))+
  xlab("Month, Year")+
  theme(axis.title.y=element_text(angle =0),
        axis.title.x=element_text(margin = margin(t=-10)),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_line(colour = "grey90"),
        panel.grid.major.x = element_line(colour = "grey90"),
        panel.grid.major.y = element_line(colour = "grey90"),
        panel.background = element_rect(fill = "white"),
        axis.text.x = element_text(size =16, angle = 45, margin = margin(t = 30)),
        axis.text.y = element_text(size =16),
        axis.line = element_line(color="grey25", size = 0.02),
        axis.title = element_text(size=16),
        legend.title = element_text(size =16),
        legend.text = element_text(size = 16),
        plot.title = element_text(size =20,hjust = 0.5))+
  ggtitle("Average Income and Expense by Education Level")

One challenge faced for this graph is to add a second legend while using dual y axis. Also, the x axis tick labels does not show all the levels.

Using Scatterplot to compare the Income and Expense in April, 2022

From the above plot, we can see that the expenses(dotted lines) are more or less same for different education levels and there is not much variation. This means that the savings will be proportional to the income and we will analyse this using scatterplot which is a better choice for continuous data.

For this we will add a column for savings to the dataset which is the difference between the income and expense.

participant_fin$tooltip <- c(paste0(
  "Id = ", participant_fin$participantId,
  "\n Savings = $",participant_fin$savings))

p2 <- participant_fin %>%
  filter(date == 'Apr 2022') %>%
  ggplot(aes(x=income, y = abs(expense), size = savings, color = educationLevel))+
  geom_point_interactive(aes(tooltip = tooltip), alpha=0.7) +
  ggtitle("Income vs Expense by different Education Levels") +
  ylab("Expense") +
  xlab("Income")+
  theme_minimal() +
  theme(axis.line = element_line(size = 0.5),
        axis.text = element_text(size = 16),
        axis.title = element_text(size=16),
        axis.title.y = element_text(angle = 0),
        legend.title = element_text(size =16),
        legend.text = element_text(size = 16),
        plot.title = element_text(size =20,hjust = 0.5))

girafe(
  ggobj = p2,
  width_svg = 16,
  height_svg = 16*0.618
)

Observing the variation in the Income and Expense of the participants over the time period

participant_fin %>%
  filter(date >= 'Apr 2022') %>%
  transform(date = as.Date(date, frac = 1)) %>%
  ggplot(aes(x=income, y = abs(expense), size = savings, color = educationLevel))+
  geom_point(alpha=0.7) +
  ggtitle("Income vs Expense by different Education Levels") +
  ylab("Expense") +
  xlab("Income")+
  theme_minimal() +
  theme(axis.line = element_line(size = 0.5),
        axis.text = element_text(size = 16),
        axis.title = element_text(size=16),
        axis.title.y = element_text(angle = 0),
        legend.title = element_text(size =16),
        legend.text = element_text(size = 16),
        plot.title = element_text(size =20,hjust = 0.5))+
  labs(title ='Period : {frame_time}')+
  transition_time(date)+
  ease_aes('linear')

Understanding the income variation between 2022 and 2023 using Cyclic plots

We have the income details of the participants for March, April and May for both the years 2022 and 2023. Let us analyse the monthly income variations of different education levels using cyclic graph.

Education_fin$month <- factor(month(Education_fin$date), 
                    levels=1:12, 
                    labels=month.abb, 
                    ordered=TRUE)
Education_fin$year <- format(Education_fin$date,"%Y")

hline.data <- Education_fin %>%
  filter( month %in% c('Mar','Apr','May')) %>%
  group_by(month) %>%
  summarise(avgvalue =mean(income))


Education_fin %>%
    filter( month %in% c('Mar','Apr','May')) %>%
  ggplot()+
  geom_line(aes(x = year,
                 y = income,
                group = educationLevel,
            colour=educationLevel),
            size = 1)+
  geom_hline(aes(yintercept = avgvalue),
             data=hline.data,
             linetype=6,
             colour="red",
             size =0.5)+
  facet_grid(~month)+
  labs(axis.text.x = element_blank(),
       title = "Income of participants in 2022 and 2023")

From the above plot, it can be seen that there is major drop in income of all education segments in the month of March in 2023 compared to 2022. The income of April is comparable between the years and a comparatively higher drop for the month of May. It is also observed that the income drop is highest for Education Levels Graduate and Bachelors.

Heatmap to understand the Expense pattern of Engagement Ohio over the days of a week

wkday_levels <- c('Saturday', 'Friday', 
                  'Thursday', 'Wednesday', 
                  'Tuesday', 'Monday', 
                  'Sunday')

expense_data <- financial_data %>%
  mutate(date = as.Date(timestamp,"%m/%d/%Y")) %>%
  group_by(date)%>%
  summarise(expense = round(abs(sum(ifelse(amount <= 0,amount,0))),digits=0))%>%
  mutate(wkday = factor(weekdays(date),levels = wkday_levels)) %>%
  mutate(wknumber = week(date))

expense_data%>%
  ggplot(aes(wknumber,wkday,fill=expense))+
  geom_tile(color = "white", size = 0.1)+
  theme_tufte(base_family = "Helvetica")+
  coord_equal() +
  scale_fill_gradient(name = "Expense",
                    low = "sky blue", 
                    high = "dark blue",
                    labels = comma)+
  labs(x = "Weeks of the Year",
       y = NULL,
       title = "Expense of Residents") +
  theme(axis.text = element_text(size = 16,margin = margin(r = -60)),
        axis.ticks.y= element_blank(),
        legend.title = element_text(size =16),
        legend.text = element_text(size = 16),
        axis.title.x = element_text(size=14))

From the above heatmap, it seems that there is a pattern in the days with high spending by the residents of the city. But it is not evident when split by weekdays. Let us see if the high expense is on a particular day of the month.

month_levels <- c('Dec','Nov','Sep','Aug','Jul', 'Jun', 
                  'May', 'Apr', 
                  'Mar', 'Feb', 
                  'Jan')

expense_month_data <- financial_data %>%
  mutate(date = as.Date(timestamp,"%m/%d/%Y")) %>%
  filter(date < '2023-03-01') %>%
  group_by(date)%>%
  summarise(expense = round(abs(sum(ifelse(amount <= 0,amount,0))),digits=0))%>%
  mutate(month = factor(as.yearmon(date))) %>%
  mutate(day = day(date))

expense_month_data%>%
  ggplot(aes(day,month,fill=expense))+
  geom_tile(color = "white", size = 0.1)+
  theme_tufte(base_family = "Helvetica")+
  coord_equal() +
  scale_fill_gradient(name = "Expense",
                    low = "sky blue", 
                    high = "dark blue",
                    labels = comma)+
  labs(x = NULL, 
     y = NULL, 
     title = "Expense of Residents") +
  theme(axis.text = element_text(size = 9,margin = margin(r = -60)),
        axis.ticks.y= element_blank(),
        legend.title = element_text(size =16),
        legend.text = element_text(size = 16))

From the above heatmap, it is seen that the expense is much higher on the 1st day of every month. This stops us seeing any pattern in the rest of the data. Let us try to see the pattern by removing the first day of the month

month_levels <- c('Dec','Nov','Sep','Aug','Jul', 'Jun', 
                  'May', 'Apr', 
                  'Mar', 'Feb', 
                  'Jan')

expense_part_month_data <- financial_data %>%
  mutate(date = as.Date(timestamp,"%m/%d/%Y")) %>%
  group_by(date)%>%
  summarise(expense = round(abs(sum(ifelse(amount <= 0,amount,0))),digits=0))%>%
  mutate(month = factor(as.yearmon(date))) %>%
  mutate(day = day(date))%>%
  filter(day > 1)

expense_part_month_data$tooltip <- weekdays(expense_part_month_data$date)

p3 <- expense_part_month_data%>%
  ggplot(aes(day,month,fill=expense))+
  geom_tile_interactive(aes(tooltip = tooltip),color = "white", size = 0.1)+
  theme_tufte(base_family = "Helvetica")+
  coord_equal() +
  scale_fill_gradient(name = "Expense",
                    low = "sky blue", 
                    high = "dark blue",
                    labels = comma)+
  labs(x = "Day of the month", 
     y = NULL, 
     title = "Expense of Residents") +
  theme(axis.text = element_text(size = 12,margin = margin(r = -60)),
        axis.ticks.y= element_blank(),
        legend.title = element_text(size =16),
        legend.text = element_text(size = 16),
        plot.title = element_text(size =18),
        axis.title.x = element_text(size = 14))

girafe(
  ggobj = p3,
  width_svg = 12,
  height_svg = 12*0.618
)

Once we removed the first day of the month, which has the highest expenditure, now we can see a pattern where the expense is higher on 2 consecutive days. From the tooltips, it can be seen that the residents spend more on saturdays and sundays.

Conclusion

From the above plots, we can infer the following about the financial health of the residents of Engagement, Ohio, USA.